{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4Pjmz-RORV8E"
      },
      "source": [
        "# Train a text labeler\n",
        "\n",
        "The [Hugging Face Model Hub](https://huggingface.co/models) has a wide range of models that can handle many tasks. While these models perform well, the best performance often is found when fine-tuning a model with task-specific data. \n",
        "\n",
        "Hugging Face provides a [number of full-featured examples](https://github.com/huggingface/transformers/tree/master/examples) available to assist with training task-specific models. When building models from the command line, these scripts are a great way to get started.\n",
        "\n",
        "txtai provides a training pipeline that can be used to train new models programatically using the Transformers Trainer framework. The training pipeline supports the following:\n",
        "\n",
        "- Building transient models without requiring an output directory\n",
        "- Load training data from Hugging Face datasets, pandas DataFrames and list of dicts\n",
        "- Text sequence classification tasks (single/multi label classification and regression) including all GLUE tasks\n",
        "- All training arguments\n",
        "\n",
        "This notebook shows examples of how to use txtai to train/fine-tune new models."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Dk31rbYjSTYm"
      },
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and all dependencies."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "XMQuuun2R06J"
      },
      "source": [
        "%%capture\n",
        "!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline-train] datasets pandas"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PNPJ95cdTKSS"
      },
      "source": [
        "# Train a model\n",
        "\n",
        "Let's get right to it! The following example fine-tunes a tiny Bert model with the sst2 dataset.\n",
        "\n",
        "The trainer pipeline is basically a one-liner that fine-tunes any text classification/regression model available (locally and/or from the HF Hub). \n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "USb4JXZHxqTA"
      },
      "source": [
        "from datasets import load_dataset\n",
        "\n",
        "from txtai.pipeline import HFTrainer\n",
        "\n",
        "trainer = HFTrainer()\n",
        "\n",
        "# Hugging Face dataset\n",
        "ds = load_dataset(\"glue\", \"sst2\")\n",
        "model, tokenizer = trainer(\"google/bert_uncased_L-2_H-128_A-2\", ds[\"train\"], columns=(\"sentence\", \"label\"))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CubsNAbpEWQg"
      },
      "source": [
        "The default trainer pipeline functionality will not store any logs, checkpoints or models to disk. The trainer can take any of the standard TrainingArguments to enable persistent models.\n",
        "\n",
        "The next section creates a Labels pipeline using the newly built model and runs the model against the sst2 validation set. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "xw2y2C5Mg11_",
        "outputId": "78400e45-ea5c-4cd9-d205-b55ee7a9f005"
      },
      "source": [
        "from txtai.pipeline import Labels\n",
        "\n",
        "labels = Labels((model, tokenizer), dynamic=False)\n",
        "\n",
        "# Determine accuracy on validation set\n",
        "results = [row[\"label\"] == labels(row[\"sentence\"])[0][0] for row in ds[\"validation\"]]\n",
        "sum(results) / len(ds[\"validation\"])"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "0.8268348623853211"
            ]
          },
          "metadata": {},
          "execution_count": 10
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZAHSwaB3Ex49"
      },
      "source": [
        "82.68% accuracy - not bad for a tiny Bert model. \n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f3GkY4JNEhhE"
      },
      "source": [
        "# Train a model with Lists\n",
        "\n",
        "As mentioned earlier, the trainer pipeline supports Hugging Face datasets, pandas DataFrames and lists of dicts. The example below trains a model using lists."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QkApw1b2hfZq",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 182
        },
        "outputId": "8c3dceae-49fb-4b63-837d-5944e63c768e"
      },
      "source": [
        "data = [{\"text\": \"This is a test sentence\", \"label\": 0}, {\"text\": \"This is not a test\", \"label\": 1}]\n",
        "\n",
        "model, tokenizer = trainer(\"google/bert_uncased_L-2_H-128_A-2\", data)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Some weights of the model checkpoint at google/bert_uncased_L-2_H-128_A-2 were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']\n",
            "- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
            "- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
            "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.weight', 'classifier.bias']\n",
            "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "    <div>\n",
              "      \n",
              "      <progress value='3' max='3' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
              "      [3/3 00:00, Epoch 3/3]\n",
              "    </div>\n",
              "    <table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: left;\">\n",
              "      <th>Step</th>\n",
              "      <th>Training Loss</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "  </tbody>\n",
              "</table><p>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cjYTxm7sFKyZ"
      },
      "source": [
        "# Train a model with DataFrames\n",
        "\n",
        "The next section builds a new model using data stored in a pandas DataFrame."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0XaKKQ32wqbs",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 182
        },
        "outputId": "edb82a45-6c2a-4718-ce0b-56030f95ffbf"
      },
      "source": [
        "import pandas as pd\n",
        "\n",
        "df = pd.DataFrame(data)\n",
        "\n",
        "model, tokenizer = trainer(\"google/bert_uncased_L-2_H-128_A-2\", data)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Some weights of the model checkpoint at google/bert_uncased_L-2_H-128_A-2 were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']\n",
            "- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
            "- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
            "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.weight', 'classifier.bias']\n",
            "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "    <div>\n",
              "      \n",
              "      <progress value='3' max='3' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
              "      [3/3 00:00, Epoch 3/3]\n",
              "    </div>\n",
              "    <table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: left;\">\n",
              "      <th>Step</th>\n",
              "      <th>Training Loss</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "  </tbody>\n",
              "</table><p>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QH3D8PQSFvQO"
      },
      "source": [
        "# Train a regression model\n",
        "\n",
        "The previous models were classification tasks. The following model trains a sentence similarity model with a regression output per sentence pair between 0 (dissimilar) and 1 (similar)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1rXuz4ncw9G-"
      },
      "source": [
        "ds = load_dataset(\"glue\", \"stsb\")\n",
        "model, tokenizer = trainer(\"google/bert_uncased_L-2_H-128_A-2\", ds[\"train\"], columns=(\"sentence1\", \"sentence2\", \"label\"))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "fyvAslSP6j0F",
        "outputId": "ec46a6aa-25a7-4777-e226-d53aeb37899b"
      },
      "source": [
        "labels = Labels((model, tokenizer), dynamic=False)\n",
        "labels([[(\"Sailing to the arctic\", \"Dogs and cats don't get along\")], \n",
        "        [(\"Walking down the road\", \"Walking down the street\")]])"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[[(0, 0.5648878216743469)], [(0, 0.97544926404953)]]"
            ]
          },
          "metadata": {},
          "execution_count": 14
        }
      ]
    }
  ]
}
